Advanced Text Analysis

SICSS-Munich, Day 4


Session 2️⃣: Going beyond bag-of-words: An introduction

Valerie Hase (LMU Munich)

github.com/valeriehase

valerie-hase.com

Agenda

  • Limitations of bow-approaches
  • Identifying meaning through ngrams
    • Keywords-in-context
    • Collocations
  • Identifying meaning through syntax
    • Part-of-speech tagging
    • Dependency parsing

The “bag-of-words” (bow) assumption

Likely ❌ wrong assumption that:

  • “treat every word as having a distinct, unique meaning” (Grimmer et al., 2022, p. 79)
  • We can represent text “as if it were a bag of words, that is, an unordered set of words with their position ignored, keeping only their frequency in the document.” (Jurafsky & Martin, 2023, p. 60)
  • In short: Assumption that we can ignore the context of words and still understand their meaning.

Repetition: “bag-of-words” (bow)

  • Disassembling texts into tokens is the foundation for the bag of words model (bow)

  • Bow as simplified representation of text where only token frequencies are considered

Note. Figure from (Jurafsky & Martin, 2023, p. 60)

Repetition: Document-Feature Matrix in R

  • This assumption is best illustrated by any analyses based on the Document-Feature-Matrix(DFM).
  • In DFM-based approaches, context is ignored (unless you explicitly include e.g. ngrams as features).
Code
library("quanteda")
library("tidyverse")
sentences <- c("I like programming", "I do not like programming")
sentences %>% 
  tokens() %>% 
  dfm()
Document-feature matrix of: 2 documents, 5 features (20.00% sparse) and 0 docvars.
       features
docs    i like programming do not
  text1 1    1           1  0   0
  text2 1    1           1  1   1

Can you come up with examples for when this assumption is violated? 🤔

Bag-of-words: A valid assumption?

Likely ❌ violated / not helpful when dealing with…

  • Polysemy: “I love this sound.” vs. “Sound solution!”
  • Negation: “Not bad!”
  • Named Entities: “United States”, “Olaf Scholz”
  • Features with similar meanings: “I like greens.” vs. “I like vegetables.”

Have you learned about any methods that relax/do not rely on the bag-of-word assumption? 🤔

Going beyond bag-of-words

  • Identifying meaning through ngrams (Session 2️⃣)
  • Identifying meaning through syntax (Session 2️⃣)
  • Identifying meaning through semantic spaces (Session 3️⃣, 4️⃣)

First dataset for today

  • We’ll use data provided by the quanteda.corpora package (install directly via Github using devtools)
  • US State of the Union addresses from 1790 to present
  • Corpus contains N = 241 speeches
Code
library("devtools")
devtools::install_github("quanteda/quanteda.corpora")
library("quanteda.corpora")
corpus_sotu <- data_corpus_sotu

Identifying meaning through ngrams in R

  • ngram: sequence of n successive features in a corpus

    • Bigram: “that is”
    • Trigram: “that is great”
    • etc.
  • Let’s check out examples from our corpus:

    Code
    tokens(corpus_sotu) %>%
      tokens_ngrams() %>%
      head(1)
    Tokens consisting of 1 document and 6 docvars.
    Washington-1790 :
     [1] "Fellow-Citizens_of" "of_the"             "the_Senate"        
     [4] "Senate_and"         "and_House"          "House_of"          
     [7] "of_Representatives" "Representatives_:"  ":_I"               
    [10] "I_embrace"          "embrace_with"       "with_great"        
    [ ... and 1,154 more ]

Repetition: Keywords-in-Context in R

  • Keywords-in-context (KWIC) as a way of displaying concordandes, i.e., specific features and their context, as a type of ngrams.

  • Let’s remember how they work:

Code
library("quanteda.textstats")
corpus_sotu %>%
  tokens() %>%
  kwic(pattern = c("United"),
       window = 3) %>%
  head(3)
Keyword-in-context with 3 matches.                                                                          
  [Washington-1790, 49] Constitution of the | United | States ( of        
 [Washington-1790, 428]    interests of the | United | States require that
 [Washington-1790, 559]     measures of the | United | States is an       

Co-Occurrence Matrix in R

  • Columns & rows denote features
  • Cells indicate how often a feature co-occurs together with another feature in the same document
  • Upper triangle: lower diagonal sparse (i.e., 0)
Code
corpus_sotu %>%
  tokens() %>%
  dfm() %>%
  fcm() %>%
  head(2)
Feature co-occurrence matrix of: 2 by 35,263 features.
                 features
features          fellow-citizens       of       the senate      and  house
  fellow-citizens             126    82580    129044    514    45185    443
  of                            0 38848591 121575105 476874 45895049 351852
                 features
features          representatives      :       i embrace
  fellow-citizens             636    415    4760      53
  of                       436949 915480 6438506   32278
[ reached max_nfeat ... 35,253 more features ]

Repetition: Collocations in R

  • Collocations as sequences of features which symbolize shared semantic meaning and often co-occur, e.g. “United States”
  • Indicated by co-occurrence of these features in similar contexts (document, sentence)
Code
corpus_sotu %>% 
  textstat_collocations(min_count = 500) %>% 
  arrange(-lambda) %>%
  head(3)
      collocation count count_nested length    lambda         z
146 great britain   518            0      2 10.091406  36.52110
7   united states  4820            0      2  9.417612 155.90572
114          i am   988            0      2  8.755190  46.01525

How could we use these methods for social science questions? 🤔

Identifying meaning through ngrams: Overview 📚

  • Methods: Keywords-in-context, collocations, ngram-shingling (not discussed here)

  • Use for: Detecting text similarities, text reuse, stereotypical associations

  • Examplary studies:

    • for collocations: Arendt & Karadas (2017)
    • for n-gram shingling: Nicholls (2019)
  • Tutorials: Puschmann & Haim (2019), Schweinberger (2023a), Watanabe & Müller (2023)

  • Packages: quanteda, textreuse and related publication (Mullen, 2020)

Going beyond bag-of-words

  • Identifying meaning through ngrams (Session 2️⃣)
  • Identifying meaning through syntax (Session 2️⃣)
  • Identifying meaning through semantic spaces (Session 3️⃣, 4️⃣)

Identifying meaning through syntax

  • We can also rely on information provided by syntax to better identify the meaning of language

  • Here, we will focus on two approaches:

    • Part-of-speech tagging
    • Dependency parsing

Part-of-Speech Tagging (PoS): Introduction

Part-of-Speech Tagging (PoS): Introduction

Image of a PoS-tagged sentence

Note. Figure from Jurafsky & Martin (2023, p. 164).

For explanation of tags, see De Marneffe et al. (2021).

Part-of-Speech Tagging in R

  • In R, usually via the spacyr package (but requires Python, installation somewhat complicated)
  • For simplicity, here via udpipe package
  • But check out the comparison between both here and here
Code
library("udpipe")
corpus_sotu %>%
  
  #change format for udpipe package
  as_tibble() %>%
  mutate(doc_id = paste0("text", 1:n())) %>%
  rename(text = value) %>%
  
  #for simplicity, run for fewer documents
  slice_head %>%
  
  #part-of-speech tagging, include only related variables
  udpipe("english") %>% 
  select(doc_id, sentence_id, token_id, token, upos) %>%
  head(5)
  doc_id sentence_id token_id    token  upos
1  text1           1        1   Fellow   ADJ
2  text1           1        2        - PUNCT
3  text1           1        3 Citizens  NOUN
4  text1           1        4       of   ADP
5  text1           1        5      the   DET

Dependency parsing: Introduction

  • Dependency parsing: describing “the syntactic structure of a sentence […] in terms of directed binary grammatical relations between the words” (Jurafsky & Martin, 2023, p. 381)
  • Define syntactic meaning of features by relation to “root”
  • Use as semantic proxy
Image of a Dependency Tree

Note. Figure from (Jurafsky & Martin, 2023, p. 381).

For explanation of tags, see De Marneffe et al. (2021).

Dependency parsing in R

  • In R, usually via the spacyr package (but requires Python)
  • For simplicity, here via udpipe package
Code
library("udpipe")
corpus_sotu %>%
  
  #change format for udpipe package
  as_tibble() %>%
  mutate(doc_id = paste0("text", 1:n())) %>%
  rename(text = value) %>%
  
  #for simplicity, run for fewer documents
  slice_head %>%
  
  #dependency parsing, include only related variables
  udpipe("english") %>% 
  select(doc_id, sentence_id, token_id, token, head_token_id, dep_rel) %>%
  head(5)
  doc_id sentence_id token_id    token head_token_id dep_rel
1  text1           1        1   Fellow             3    amod
2  text1           1        2        -             3   punct
3  text1           1        3 Citizens             0    root
4  text1           1        4       of             6    case
5  text1           1        5      the             6     det

Dependency parsing in R

  • Using the rsyntax package, we can even plot this to better understand these relations!
Code
library("rsyntax")
sentence <- udpipe("My only goal in life is to understand dependency parsing", "english") %>%
  as_tokenindex() %>%
  plot_tree(., token, lemma, upos)
Image of a Dependency Tree

How could we use these methods for social science questions? 🤔

Identifying meaning through syntax: Overview 📚

  • Methods: Part-of-speech tagging, dependency parsing

  • Use for: Detecting entities, entity-specific sentiment, sources, etc.

  • Examplary studies:

    • for PoS: Burggraaff & Trilling (2020)
    • for dependency parsing: Van Atteveldt et al. (2017), Fogel-Dror et al. (2019)
  • Tutorials: Benoit & Matsuo (2020), Schweinberger (2023b)

  • Packages:

Any questions? 🤔

References

Arendt, F., & Karadas, N. (2017). Content Analysis of Mediated Associations: An Automated Text-Analytic Approach. Communication Methods and Measures, 11(2), 105–120. https://doi.org/10.1080/19312458.2016.1276894
Benoit, K., & Matsuo, A. (2020). A Guide to Using Spacyr. https://spacyr.quanteda.io/articles/using_spacyr.html
Burggraaff, C., & Trilling, D. (2020). Through a different gate: An automated content analysis of how online news and print news differ. Journalism, 21(1), 112–129. https://doi.org/10.1177/1464884917716699
De Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 1–54. https://doi.org/10.1162/coli_a_00402
Fogel-Dror, Y., Shenhav, S. R., Sheafer, T., & Van Atteveldt, W. (2019). Role-based Association of Verbs, Actions, and Sentiments with Entities in Political Discourse. Communication Methods and Measures, 13(2), 69–82. https://doi.org/10.1080/19312458.2018.1536973
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.
Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf
Mullen, L. (2020). Textreuse: Detect text reuse and document similarity.
Nicholls, T. (2019). Detecting Textual Reuse in News Stories, At Scale. International Journal of Communication, 13, 4173–4197.
Puschmann, C., & Haim, M. (2019). Automated Content Analysis with R. https://content-analysis-with-r.com/
Schweinberger, M. (2023a). Analyzing co-occurrences and collocations in r. The University of Queensland, Australia. School of Languages; Cultures.
Schweinberger, M. (2023b). Part-of-speech tagging and dependency parsing with r. The University of Queensland, School of Languages; Cultures.
Van Atteveldt, W., Sheafer, T., Shenhav, S. R., & Fogel-Dror, Y. (2017). Clause Analysis: Using Syntactic Information to Automatically Extract Source, Subject, and Predicate from Texts with an Application to the 2008–2009 Gaza War. Political Analysis, 25(2), 207–222. https://doi.org/10.1017/pan.2016.12
Watanabe, K., & Müller, S. (2023). Quanteda Tutorials. https://tutorials.quanteda.io/
Welbers, K., Van Atteveldt, W., & Kleinnijenhuis, J. (2021). Extracting semantic relations using syntax: An R package for querying and reshaping dependency trees. Computational Communication Research, 3(2), 1–16. https://doi.org/10.5117/CCR2021.2.003.WELB